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Abstract 



We introduce an efficient computational framework for hashing data belonging to 
multiple modalities into a single representation space where they become mutually 
comparable. The proposed approach is based on a novel coupled Siamese neural 
network architecture and allows unified treatment of intra- and inter-modality sim- 
ilarity learning. Unlike existing cross-modality similarity learning approaches, our 
hashing functions are not limited to binarized linear projections and can assume 
arbitrarily complex forms. We show experimentally that our method significantly 
outperforms state-of-the-art hashing approaches on multimedia retrieval tasks. 



1 Introduction 



Similarity is a fundamental notion underlying a variety of computer vision, pattern recognition, and 
machine learning tasks ranging from retrieval, ranking, classification, and clustering to object de- 
tection, tracking, and registration. In all these problems, one has to quantify the degree of similarity 
between objects usually represented as feature vectors. While in some cases domain- specific knowl- 
edge dictates a natural similarity function, most generally a "natural" measure of similarity is rather 
elusive and cannot be constructed without side information provided e.g. through human annotation. 
An even more challenging setting frequently arises in tasks involving multiple media or data coming 
from different modalities. For example, a medical image of the same organ can be obtained using 
different physical processes such as CT and MRI; a multimedia search engine may perform queries 
in a corpus consisting of audio, video, and textual information. While domain knowledge can be 
used to construct reasonable similarity functions for each data modality, it is much more challenging 
to create a consistent and meaningful similarity measure across them. 



Previous work The idea of constructing similarity measures suitable to specific data has been 
thoroughly explored by the statistics and machine learning communities. One can roughly divide 
similarity learning methods into unsupervised and supervised. The former class uses only the data 
with no additional side information. Unsupervised methods include PC A and its kernelized version 
( |Schoelkopf et al. ([1997 )) that approximate the data globally by their second-order statistics either in 
the original Euclidean space or in a feature space represented by a kernel; and various local embed- 
ding methods such as the locally linear embed ding ([Roweis & Saul| (|2000| )), Laplacian eigenmaps 
(Belkin & Niyogi (2003 )), and diffusion maps (Coifma n & Lafon ( |2006| )), which are all based on 
the assumptions that the data residing in a high-dimensional Euclidean space actually belong to a 
low-dimensional manifold, a parametrization of which is looked for. Unsupervised methods are 
inherently limited due to their inability to incorporate side information into the learning process. 
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Supervised methods can be further subdivided according to the type of side information they rely on. 
Class labels is the m ost straightforward way of specifying side information, and is used in meth ods 
dating back to LDA ( Johnson & Wichern ( 20021) and its kernelized version (|Mika et al.| ( |1999| )) as 
well as more modern approaches of |Xing et al.| ( |2002| ); [Weinberger & Saul]( |2009|). Other meth ods 
accept side information in the form of knowingly si milar and d issimilar pai rs ([Davis et al.|(|2007|)) o r 
triplets of the form "x is more similar to y than z" (She n et aL| ([2009); McFee & Lanckriet (2 009| )). 



A family of methods referred to as multidimensional scaling (MDS) rely on metric dissimilarity 
values supplied on a training set of pairs of data vectors, and se ek for a Euclidean representation 
reproducing them as faithfully as possible (Borg & Groenen (2005)). Once the embedding into the 



representation space has been learned, similarity to new, unseen data is computed either directly (if 
the metric admits a parametric representation), or using an out-of- sample extension. 

Similarity learning methods can also be classified by the type of the produced similarity functions. A 
significant class of practical methods learns a linear projection making the Euclidean metric optimal 



-this is ess entially equivalent to learning a Mahalanobis distance (Weinberger & Saul (2009); Shen 
et al.| ( 2009] )). Kernelized versions of these approaches are often available in the cases where the 



data have an intricate structure that cannot be captured by a linear transformation. 

Hashing approaches represent the data as binary codes to which the Hamming metric is subsequently 
applied as the measure of similarity. These methods include the family of locality sensitive hash- 
ing (LSH) ( [Gionis et al.| ( [1999) ), and the recently introduced spectral hashing ( Wei ss et aT1 p008)). 



These approaches are mainly used to construct an efficient approximation of some trusted standard 
similarity such as the Jaccard index or the cosine distance, and are inapplicable if side informa- 
tion has to be relied upon. Shakhna rovich et aL] ( |2003| ) proposed to construct optimal LSH-like 
hashes (referred to as similarity-sensitive hashing or SSH) using supervised learning. More efficient 
approaches have been subsequently proposed by Torralb a et al. ( |2008| ) and[Strecha e t al. ( 2012 ). 



The extension of similarity learning to multi-modal data has been addressed in the literature only 
very recently. Bronstei n et al.| ( [201 0| ) used a supervised learning algorithm based on boosting to 
construct hash functions of data belonging to different modalities in a way that makes them com- 
parable using the Hamming metric. This method can be viewed as an extension of SSH to the 
multimodal setting, dubbed by the authors cross-modal SSH (CM-SSH), and it enjoys the com- 



pactness of the representation and the low complexity involved in distance computation. |McFee & 
Lanck riet|([201 1|) proposed to learn mult i -moda l similarity using ideas from multiple kernel learning 
( |B ach et al . | ( |2004|) ; [M cFee & Lanckriet ( 2009 )). Multi-modal kernel learning approaches have been 
proposed by |Lee et al.| ( |2009| ) for medical image registration, and by |Weston et al.| ([2010). The 



main disadvantage of the latter is the fa ct that it is limited to linear projections only. The framework 
proposed by McFee & Lanckriet ( 201 1 ) can be kernelized, but it involves the computationally expen- 
sive semidefinite programming, which limits scalability. Also, both algorithms produce continuous 
Mahalanobis metrics, which is disadvantageous both in computational and storage complexity espe- 
cially when dealing with large-scale data. The appealing property of similarity -preserving hashing 
methods like the CM-SSH |Bronstein et al.| ( [20T0| ) is the compactness of the representation and the 



low complexity involved in distance computation. 



Contributions This paper is motivated by the work of Bronstein et al. (2010) on multimodal 
similarity -preserving hashing. We propose a novel multi-modal similarity learning framework based 
on neural networks. Our approach has several advantages over the state-of-the-art. First, we combine 
intra- and inter-modal similarity into a single framework. This allows exploiting richer information 
about the data and can tolerate missing modalities; the latter is especially important in sensor net- 
works where one or more sensors may fail or in application like multimedia retrieval where it is hard 
to obtain reliable samples of cross-modal similarity. We show that previous works can be considered 
as particular cases of our model. Second, we solve the full optimization problem without resorting 
to relaxations as in SSH-like met hods; it has been rec e ntly shown that suc h a relaxation degrades 
the hashing performance (see e.g., Strecha et al. (2012); Masci et al. ( 2011| )). Third, we introduce a 
novel coupled Siamese neural network architecture to solve the optimization problem underlying our 
multi-modal hashing framework. Fourth, the use of neural networks can be very naturally general- 
ized to more complex non-linear projections using multi-layered networks, thus allowing to produce 
embeddings of arbitrarily high complexity. We show experimental result on several standard multi- 
modal datasets demonstrating that our approach compares favorably to state-of-the-art algorithms. 
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2 Background 



Let ICR" and FCR n be two spaces representing data belonging to different modalities (e.g., 
X are images and Y are text descriptions). Note that even though we assume that the data can be 
represented in the Euclidean space, the similarity of the data is not necessarily Euclidean and in 
general can be described by some metrics dx ' X x X — >• R + and dy : Y x Y — >• R + , to which 
we refer as intra-modal dissimilarities. Furthermore, we assume that there exists some inter-modal 
dissimilarity dxY ' X x Y — >> R + quantifying the "distance" between points in different modality. 
The ensemble of intra- and inter-modal structures dx,dy, dxy is not necessarily a metric in the 
strict sense. In order to deal with these structures in a more convenient way, we try to represent 
them in a common metric space. In particular, the choice of the Hamming space offers significant 
advantages in the compact representation of the data as binary vectors and the efficient computation 
of their similarity. 

Multimodal similarity -preserving hashing is the problem of represent the data from different modal- 
ities X, Y in a common space H m = {±l} m of m-dimensional binary vectors with the Hamming 
metric d^m (a, b) = y — \ Y^T=i a ^ means of two embeddings, £ : X — >> H m and 77 : Y — >> M m 
mapping similar points as close as possible to each other and dissimilar points as distant as possible 
from each other, such that d^m o (£ x £) ~ dx, d^™ o (77 x 77) « dy , and de™ ° (£ x ^) ~ In 
a sense, the embeddings act as a metric coupling, trying to construct a single metric that preserves 
th e intra- and inter-moda l similarities. A simplified setting of the multimodal hashing problem used 
in |Bronstein et al.| ( |2010| ) is cross -modality similarity -pre serving hashing, in which only the inter- 
modal dissimilarity dxY is taken into consideration and dx , dy are ignored. To the best of our 
knowledge, the full multimodal case has never been addressed before. 

For simplicity, in the following discussion we assume the side information given as the intra- and 
inter-modal dissimilarities to be binary, dx,dy, dxy G {0, 1}, i.e., a pair of points can be either 
similar or dissimilar. This dissimilarity is usually unknown and hard to model, however, it should be 
possible to sample dx,dy, dxy on some subset of the data X' C X,Y' C Y. This sample can be 
represented as sets of similar pairs of points (positives) Vx = {(x G X', x' , G X') : dx{x,x r ) — 
0}, Vy = {(y G r, y' G Y') : d Y (y, y') = 0}, and V X y = {(x G X> \y G Y') : d X y (x, y) = 0}, 
and likely defined sets J\fx , A/y , and A/xy of dissimilar pairs of points (negatives). In many practical 
applications such as image annotation or text-based image search, it might be hard to get the inter- 
modal positive and negative pairs, but easy to get the intra-modal ones. 

The problem of multimodal similarity-preserving hashing boils down to find two embeddings £ : 
X — )> M m and r] : Y — >> H m such that m d^m o (£ x 77) « d^Y minimizing the aggregate of false 
positive and false negative rates, 

min E{d M - o (f x Ol^x} +E{de- fa x r?)|Py} + E{d e - o (£ x t?)|Pxy} - 

o (£ x Ol^x} -E{de- o (77 x r^A/V} - E{d M ™ o (£ x 77)!^}. (1) 

Cross-modality similarity sensitive hashing |Bronstein et al.| ( |2010| ) studied the particular case 
of cross-modal similarity hashing (without incorporating intra-modality similarity), with linear em- 
beddings of the form £(x) = sign (Px + a) and 77 (y) = sign (Qy + b). Their CM-SSH algorithm 
constructs the dimensions of £ and 77 one-by-one using boosting. At each iteration, one-dimensional 
embeddings &(x) = sign (p^x + a*) and 77; (y) = sign (q^y + bi) are found using a two-stage 
scheme: first, the embeddings are linearized as &(x) w p^x and 77^(y) « q^y and the resulting 
objective is minimized to find the projection 

min E{x T p^q z y \V X y} - E{x T pTq,y|A^y }, (2) 

(here p^ and q^ are unit vectors representing the zth row of the matrices P and Q, respectively, 
and the expectations are weighted by per- sample weights adjusted by the boosting). With such an 
approximation, the optimal projection directions p and q have a closed-form expressions using SVD 
of the positive and negative covariance matrices. At the second stage, the thresholds ai and bi are 
found by one-dimensional search. 

This approach has several drawbacks. First, CM-SSH solves a particular setting of problem ([T]) 
with Txy^Nxy only, thus ignoring the intra-modality similarities. Second, the assumption of 
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separability (treating each dimension separately) and the linearization of the objective replace the 
original problem with a relaxed version, whose optimization produces suboptimal solutions that tend 
to increase the hash sizes (or alternatively, for a fixed hash length m, the method manifests inferior 
performance; see |Masci et al.| ( [20TT] )). Finally, this approximation is limited to a relatively narrow 
class of linear embeddings that often do not capture well the structure of the data. 



3 Multimodal NN hashing 



Our approach for multimodal hashing is rel ated to supervised methods for dimensionality reduction 
and in particular ex tends the framework of (Schmidhuber & Prelinger ( 1993| ); |Hadsell et aL| (2006); 



Taylor et al. (20TTb), also known as the Siamese architecture. These methods learn a mapping onto 



a usually low-dimensional feature space such that similar observations are mapped to nearby points 
in the new manifold and dissimilar observations are pulled apart. In our simplest setting, the linear 
embedding £ = sign(Px + a) is realized as a neural network with a single layer (where P represent 
the linear weights and a is the bias) and a sign activation function (in practice, we use a smooth 
approximation sign(x) « tanh(/3x)). The parameters of the embedding can be learned using the 
back-propagation algorithm ( Werbos (1974])) minimizing the loss 

= \ E UW-ttx'M + l E max{0,m x -||£(x)-£(x')|| 2 } 2 ( 3 ) 

(x,x')GPx (x,x')€A/"x 

w.r.t. the network parameters (P,a). Note that for binary vectors (when f3 = oo), the squared 
Euclidean distance in ^ is equivalent up to constants to the Hamming distance. The second term 
in ^ is a hinge-loss providing robustness to outliers and produces a mapping for which negatives 
are pulled mx apart. The system is fed with pairs of samples which share the same parametrization 
and for which a corresponding dissimilarity is known, for positives and 1 for negatives (thus the 
name Siamese network, e.g. two inputs and a common output vector). 



Coupled Siamese architecture In the multimodal setting, we have two embeddings £ and rj, each 
cast as a Siamese network with parameters (P,a) and (Q,b), respectively. Such an architecture 
allows to learn similarity- sensitive hashing for each modality independently by minimizing the loss 
functions Cx , Cy • In order to incorporate inter-modal similarity, we couple the two Siamese net- 
works by the cross-modal loss 

£xy = \ E W^)-v{y)\\l + \ E max{CWy.-|l£(x)-rKy)ll2} 2 ,(4) 

thus jointly learning two sets of parameters for each modality. We refer to this model, which gener- 
alizes the Siamese framework, as coupled Siamese networks for which a schematic representation is 
shown Figure [T] 

Our implementation differs from the original architecture of Hads ell et al.| ([2006 ) in the choice of the 
output activation function (we use tanh activation that encourages binary representations rather than 
a linear output layer). This way the maximum distance is bounded by \/Am and by simply enlarging 
the margin between dissimilar pairs we enforce the learning of codes which differ by the sign of 
their components. Once the model is learned, hashes are produced by thresholding the output. 

The reader should also note that the hamming distance is equal to the squared euclidean distance. 
Hence the loss function in eq|5j when f3 — )> -f oo, margins = and a = 1 coincides with eq[T] 
However for optimization reasons a margin needs to be added. 



Training The training of our coupled Siamese network is performed by minimizing 

min Cxy + &xCx + a Y Cy, (5) 

P,a,Q,b 

where ax , oty are weights determining the relative importance of each modality. The loss ^ can 
be considered as a generalization of the loss in ([I]), which is obtained by setting ax = oty = 1, 
margins = 0, and /3 = oo. Furthermore, setting ax = ay = 0, we obtain the particular setting 



of cross-modal loss, whose relaxed version is minimized by the CM-SSH algorithm of Bronstein 
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Figure 1: Schematic representation of our coupled Siamese network framework. Two pairs of nets, 
one for modality X and one for modality Y are coupled together by a cross-modal loss. The system 
learns two set of parameters. 



et al.| (|2010|). It is also worth repeating that in many practical cases, it is very hard to obtain re- 



liable cross-modal training samples (Vxy,Nxy) but much easier to obtain intra-modal samples 
(Vx , Nx > Vy > A/y ). In the full multimodal setting (ax , oty > 0), the terms Cx , £>y can be consid- 
ered as a regularization, preventing the algorithm from over fitting. 



We apply the back-propagation algorithm (Werbos 



( T974| ) ; |LeCun| fL985] ) ; [Rumelha rt et al.| ff98g) ) 



to get the gradient of our model w.r.t. the embedding parameters. The gradients of the intra- and 
inter-modal loss functions w.r.t. to the parameters of £ are given by 

(x,x') eV x 



(e( X )-e(xO)(ve(x)-ve(xO) 
(e(x)-e(xo-m x )(ve(x)-ve(xO) 



(x,x') e Mx and 

m x >ne(x)-e(xoii 2 



XY 





(e(x)-77(y))Ve(x) 
(£(x) - 77(y) - m xy )V£(x) 




else 
(^y)eVxY 
(x,y) G A/xy and m X Y > ||f(x) - 77(y) || 2 
else 



where the term V£ = d^/d (P, a) is the usual back-propagation step of a neural network. Equivalent 
derivation is done for the parameters of rj. The model can be easily learnt jointly using any gradient- 
based technique such as conjugate gradient or stochastic gradient descent. 



Non-linear embeddings Our model straightforwardly generalizes to non-linear embeddings using 
multi-layered network architecture. The proposed framework is in fact general and any class of 
neural networks can be applied to arbitrarily increase the complexity of the embedding. Deep and 
hierarchical models are able to model highly non-linear embeddings and scale well to large-scale 
data by means of fully online learning, where the parameters are updated after every input tuple 
presentation. This allows to sample a possibly huge space with constant memory requirements. 



4 Results 



We tested our algorithm on cross-modal data retrieval tasks using standard datasets from the shape 
retrieval and multimedia retrieval communities. We compared three algorithms: our coupled Siamese 
framework in the full multimodal setting (MM-NN) and its reduced version (CM-NN), as well as 
CM-SSH. The single-layer version (denoted LI) of CM-NN and MM-NN realizes a linear embed- 
ding function and compares directly with CM-SSH. Two-layered version (L2) allows to obtain more 
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Figure 2: Mean average precision (mAP) vs. hash length m for the ShapeGoogle retrieval experi- 
ment. Left: cross-modal (BoF-SS-BoF); center. BoF; right: SS-BoF. LI and L2 refer to single and 
two-layer neural networks, respectively. Performance of raw descriptors in each modality with L 2 
distance is shown in dotted. 



complex non-linear embeddings. For training the neural networks, we used conjugate gradients. 
The hash functions learned by each of the methods were applied to the data in the datasets, and the 
Hamming distance was used to rank the matches. Retrieval performance was evaluated using mean 
average precision mAP = J2f=i P( r ) ' ^eZ(r), where rel(r) is the relevance of a given rank (one if 
relevant and zero otherwise), R is the number of retrieved results, and P(r) is precision at r, defined 
as the percentage of relevant results in the first r top-ranked retrieved matches. 



Shape Google In the firs t experiment, we reproduced th e multimodal shape retr ieval experiment 
of |Bronstein et al.| ( |2010| ) using the ShapeGoogle dataset [Bronstein et al.| ( |20lT] ), containing 583 



geometric shapes of 12 different classes subjected to synthetic transformations as well as 456 unre- 
lated shapes ("distractors"). The goal was to correctly match a transformed shape from the query 
set to the rest of the dataset. The shapes were represented using 32-dimensional bag of geometric 
features (BoF) and 64-dimensional spatially- sensitive bags of geometric features (SS-BoF). To learn 
the hashing functions, we used positive and negative sets of size \V\ = 1Q 4 and \M\ = 5 x 10 4 , 
respectively. For CM-SSH, we used the code with settings provided by Bronst ein et al.| ( [20T0| ). For 



MM-NN, we used single-layer architecture with margins mx = my = 1, mxy = 3 and ax = 0.1, 
ay = 0.3 which we empirically found to be the best combination (additional results with different 
parameters are shown in Tableland in supplementary materials). In addition, we also show a two- 
layer architecture with 128 hidden nodes. For CM-NN we used tuxy = 3 as for the single layer 
case. MM-NN used rrix = my = 1, mxy = 3 and ax = 0.3, ay = 0.3. 

Figure|2]and Table[T]shows the performance of different methods as function of hash length m. First, 
we can see that NN-based methods (CM-NN and MM-NN) dramatically outperform the boosting- 
based CM-SSH for a fixed hash length. MM-NN achieves almost perfect performance using only 12 
bits (for comparison, CM-SSH requires almost 100 bits to achieve similar performance). The reason 
is likely to be the fact that CM-SSH resorts to relaxation of the problem thus producing a suboptimal 
solution, while NN-hash solved the "true" optimization problem. Secondly, adding another layer to 
the neural network we obtain a non-linear hashing function, which performs dramatically better than 
a single-layered architecture, achieving near-perfect performance with 8 bits. Thirdly, fully multi- 
modal method (MM-NN) consistently outperforms the cross-modal version (CM-NN). We attribute 
this fact to the use of the intra-modal losses, acting as regularization. 
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Table 1: Performance of different methods (mAP) on the ShapeGoogle retrieval experiment. LI 
and L2 refer to neural networks with 1 or 2 layers. 
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Figure [3] visually exemplifies a retrieval experiment for MM-NNhash where the query shape on 
the left-most side is compared with a similar shape, middle, and a dissimilar one, right-most. The 
produced hash vectors on the bottom row are shown along with the BoF and the SS-BoF descriptors. 




Figure 3: ShapeGoogle retrieval example. Top: original shapes, middle: BoF descriptor (left) and 
SS-BoF (right), bottom: MM-NNhash binary descriptors. 

Importance of intra-modal regularization is exemplified in Table [2] In this experiment, we per- 
formed training of a 1 layer net using a subset of the cross-modal data (Vxy,Nxy), while keeping 
the intra-modal data in the MM-NN. The CM-NN method manifested significant performance drop 
(attributed most likely to overfitting), while the performance of MM-NN remains practically un- 
changed. 
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Table 2: Performance of different methods (mAP) on the ShapeGoogle retrieval experiment where 
only a subset of the cross-modal correspondences is kept . Hash of length m = 16 is used and the 
best configuration is found through grid search. 



Choice of parameters The theoretically smallest hash length must be m = \\og 2 (#classes)] . 
However, since we are using a simple embedding, in practice an m about 5-10 times larger may be 
required to achieve satisfactory results. Tableland Figure [4] show the performance of the NNhash 
methods under different choices of the parameters. We can see that the addition of intra-modal 
regularization makes the cross-modal performance less sensitive to the choice of the parameters, and 
that MM-NN produces higher cross-modal performance than CM-MM for similar margin settings. 




'8 9 10 11 12 13 14 15 16 ' 8 9 10 11 12 13 14 15 16 ' 8 9 10 11 12 13 14 15 16 



Figure 4: mAP for the ShapeGoogle experiment with various configuration of the hyper-parameters 
of the system. 
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Table 3: Performance of different methods in the ShapeGoogle cross-modal retrieval experiment 
using several hash lengths and various selection of the parameters for single layer nets. The settings 
with ax = ay = rrix = my = correspond to CM-NN. 



NUS In the second experiment, we used the NUS dataset of |Chua et al. ([2009), containing about 



250K annotated images from Flickr. The images are manually categorized into 81 classes (one 
image can belong to more than a single class) and represented as 500-dimensional bags of SIFT 
features (BoF, used as the first modality) and 1000-dimensional bags of text tags (Tags, used as the 
second modality). The dataset was split into approximately equal parts for testing and training. We 
used positive and negative sets of size \V\ = \Af\ = 5 x 10 5 . Positive pairs were images belonging 
at least to one common class; negative pairs were images belonging to disjoint sets of classes. For 
MM-NN, we used the margins rrixy = 7, rrix = my = 3 and ax = ay = 0.3. CM-NN used 
mjy = 7. Testing was performed using a query and database sets of size approximately 10 4 and 



1.8 x 10 5 , respectively. First ten matches were found using approximate nearest neighbors | Arya 
et al.| < [1998] ). Matches that had at least one class in common with the query were considered correct. 



Table [4] compares the performance of different methods. MM-NN outperforms other approaches in 
all quality criteria. Figure [5] shows examples of top matches using MM-NN. 
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79.39% 

75.62% 


87.41% 

86.23% 


CM-SSH 


53.7% 


50.2% 


54.0% 


76.0% 



Table 4: mAP of different methods for the NUS experiment. Hashes of length 64 were used. 

Figure [6] shows retrieval results using as queries artificially created Tag vectors containing specific 
words such as "cloud". These tags are hashed using r] and matched to BoFs hashed using £. The 
retrieved results are meaningful and most of them belong to the same class. It is especially interesting 
to note that MM-NN, apart from the Labels in the ground-truth which are in general noisy, produced 
relevant results in both cases. Figure [7] shows image annotation results. We retrieve the top five Tags 
matches from a BoF query and assign the ten most frequent annotations to the image. We clearly 
see that MM-NN produces better annotations than CM-SSH also in this case. 



Wiki In the third experiment, we reproduced the results of Rasi wasia et al.| ( [20T0| ) using the dataset 
of 2866 annotated images from Wikipedia. The images are categorized in 10 classes and represented 
as 128-dimensional bags of SIFT features (Image modality) and 10-dimensional LDA topic model 
(Text modality). The dataset was split into disjoint subsets of 2173 and 693 for training and testing 
respectively. We used positive and negative sets of size \V\ = 1 x 10 4 , \J\f\ = 1 x 10 5 . Table [5] 
shows the mAP for the Image-Text and Text-Image cross-modal retrieval experiment. For reference, 
we also reproduce the results reported in Rasiwasia et al.| ( [20T0| ) using correlation matching (CM), 



semantic matching (SM), and semantic correlation matching (SCM). MM-NN slightly outperforms 
SCM on average. We should stress however that these results are not directly comparable with ours: 
while [Rasiwasi aet al.| ([2010) find a Euclidean embedding, we use Hamming embedding (in general, 
a more difficult problem). While having similar performance to SCM, the significant advantage of 
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lake,ocean,reflection,sky,sunset,water 






Figure 5: Top 5 matches for MM-NNhash in BoF-Tags (rows 1-5) and Tags-BoF (rows 6-10) cross- 
modal retrieval. Rows 1, 6: classes; rows 2, 7: images; rows 3, 8: tags (subdued implies the modality 
was not used); rows 4, 9: resulting hashes; rows 5,10: Hamming distance from query. 







cloud 



red+food 



Figure 6: Example of text-based image retrieval on NUS dataset using multimodal hashing. Shown 
are top five image matches produced by CM-SSH (top) and MM-NN (bottom) in response to two 
different queries: cloud (left) and red+food (right). Relevant matches are shown in green. 



California explore night dog australia 
beach boston dogs dusk evening 

dog animal dogs pet pets animals 

nature belgium blue bravo 




explore flower beautiful black colorful 
flowers green interestingness light nature 

clouds sun sunset beach rocks sky 
colors dance dock exposure 



Figure 7: Example of image annotation on NUS dataset using multimodal hashing. Shown are tags 
returned for image query using CM-SSH (top) and MM-NN (bottom). Groundtruth tags are shown 
in green; synonyms are italicized. 
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our approach is that it produces much smaller compact binary codes (at least 10 x smaller) that can 
be searched very efficiently. 



Image-Text Text-Image Average 



MM-NN ^2 


27.8% 
28.5% 


21.2% 
22.0% 


24.5% 
25.3% 


CM-NN £2 


26.7% 
27.1% 


20.9% 
21.1% 


23.8% 
24.1% 


CM-SSH 


22.2% 


18.4% 


20.3% 


CM* 
SM* 
SCM* 


24.9% 
22.5% 
27.7% 


19.6% 
22.3% 
22.6% 


22.3% 
22.4% 
25.2% 



Table 5: Performance of different methods in the cross-modal retrieval experiments (Image and 
Text modalities) on the Wiki dataset. 32-bit hashes were used. Results marked with * are from 



Rasiwasia et al.| ( |2010| ) based on Euclidean embedding; these results are not directly comparable 
with hashing and are brought here for reference only. 



5 Conclusions 

We introduced a novel learning framework for multimodal similarity-preserving hashing based on 
the coupled Siamese neural network architecture. Our approach is free from assuming linear projec- 
tions unlike existing cross-modal similarity learning methods; in fact, by increasing the number of 
layers in the network, mappings of arbitrary complexity can be trained (our experiments showed that 
using multilayer architecture results in a significant improvement of performance). We also solve 
the exact optimization problem during training making no approximations like the boosting-based 
CM-SSH. Our method does not involve semidefinite programming, and is scalable to a very large 
number of dimensions and training samples. Experimental results on standard multimedia retrieval 
datasets showed performance superior to state-of-the-art hashing approaches. 
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